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ABSTRACT 



Utterance -based mean removal in log-domain, or in any 
linear transformation of log-domain, e.g., cepstral domain, is 
known to improve substantially a recognizer's robustness to 
transducer difference, channel distortion, and speaker varia- 
tion. Applicants teach a sequential determination of utter- 
ance log-spectral mean by a generalized maximum a pos- 
teriori estimation. The solution is generalized to a weighted 
sum of the prior mean and the mean estimated from avail- 
able frames where the weights are a function of the number 
of available frames. 

5 Claims, 2 Drawing Sheets 
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SEQUENTIAL DETERMINATION OF 
UTTERANCE LOG-SPECTRAL MEAN BY 
MAXIMUM A POSTERIORI PROBABILITY 
ESTIMATION 

This application claims priority under 35 USC §11 9(e) 
(1) of provisional application number 60/083,926, filed May 
1, 1998. 



where c„ is an estimate of mean up to frame n and a is a 
weighting value between zero and one. 

Among the choices for the initial mean Cq and weighting 
factor a, the prior art discusses two cases. 

The first is the cumulative mean removal case where 



To = 0 and a 



n-1 



(4) 



HELD OF THE INVENTION 

This invention relates to speech recognition and more 
particularly to determination of utterance recognition 
parameter. 

BACKGROUND OF THE INVENTION 

Referring to FIG. 1 there is illustrated a block diagram of 
a speech recognition system comprising a source 13 of 
Hidden Markov Models (HMM) and input speech applied to 
a recognizer 11. The result is recognized speech such as text. 
One of the sources of degradation for speech recognition of 
the input speech is the distortion due to transducer 
difference, channel, and speaker variability. Because this 
distortion is assumed to be additive in the log domain, 
utterance-based mean normalization in the log domain (or in 
any linear transformation of log domain, for example, cep- 
stral domain) has been proposed to improve recognizers' 
robustness. See, for example, S. Furui, "Cepslral Analysis 
Technique for Automatic Speaker Verification," IEEE Trans. 
AcousL, Speech and Signal Processing, ASSP-29(2) 
: 264-272, 1981. Due to its computational simplicity and 
substantial improvement in results, such mean normalization 
has become a standard processing technique for most rec- 
ognizers. 

To do such normalization, the utterance log-spectral mean 
must be computed over all N frames: 




(1) 



where c„ is the n''* log spectral vector. The log spectral 
vectors are produced by sampling the incoming speech, 
taking a block or window of samples, performing a discrete 
Fourier transform on these samples, and performing loga- 
rithm of the transform output. 

The technique is not suitable for on-line real time opera- 
tion because, due to the requirement of the utterance mean, 
the normalized vectors can not be produced until the whole 
utterance has been observed. In equation 1, c^ is the 
log-spectral vector averaged over N windows. Since N 
means all N frames the application to real-time system is 
limited. 

To solve this problem, sequential estimation of the mean 
vector with exponential smoothing techniques has been 
disclosed. See M. G. Rahim and B. H. Juang, "Signal Bias 
Removal by Maximum Likelihood Estimation for Robust 
Telephone Speech Recognition," IEEE Trans, on Speech 
and Audio Processing, 4(1): Jan. 19-30, 1996. Tht sequen- 
tial determination is that as we get more vectors we get 
better and better estimates as follows 

c„-a'C„_i(past estimate)+(l-a) c„(current input vector) (2) 

and the mean -subtracted vector: 

c„=c„-3„ (3) 



10 Equation 2 reduces to 

25 

In this-case at time n, the mean vector is approximated by 
the mean of all vectors observed up to time n. For large n. 
Equation 5 gives a mean that is very close to the true 
utterance mean, i.e., it converges to the utterance mean in 
Equation 1. On the other hand, when Co=0, no prior knowl- 
edge of the mean is used, which will make the mean 
unreliable for short utterances. The second case is called 
exponential smoothing. The second case sets 

C(j=inean vector over training data and a is between 0 and 1. (6) 

25 Rearranging Equation 2, we get 



?n = a" • CO 4- ( 1 - tt)^ (/■"'' • Cn 

;-=! 

30 

The second term of Equation 7 is a weighted sum of all 
vectors observed up to time n. Due to the exponential decay 
of the weights a""^, only the immediate past observed 
vectors are dominant contributors to the sum, while the more 
35 distant past vectors contribute very little. Consequently, for 
large n the mean given by Equation 7 will not usually be 
close to the true utterance mean, i.e., asymptotically, expo- 
nential smoothing does not give the utterance mean, 

SUMMARY OF THE INVENTION 

^0 In accordance with one embodiment of the present inven- 
tion an estimate of the utterance mean is determined by 
maximum a posterior probability (MAP) estimation. This 
MAP estimation is subtracted from the log-spectral vector of 
the incoming signal to be applied to a speech recognizer in 

45 a speech recognition system. 

DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a prior art recognizer system; 

FIG. 2 is a chart illustrating typical 60 (n) values as a 
5Q function of frame number for linear (Z=2300 ras, linear 
2300) and exponential (y=0.985, exp=0.985) decaying; 

FIG. 3 illustrates a block diagram of the system according 
to one embodiment of the present invention; 

FIG. 4 illustrates word error rates as functions of ALPHA 
J J (a) for sequential cepstral mean subtraction (Equation 2); 

FIG. 5 illustrates word recognition rates as functions of 
Rho(p) for MAP cepstral mean estimation (Equation 11); 

FIG. 6 illustrates word recognition error rates as functions 
of Y (GAMMA) for exponential weights (Equation 15); and 

FIG. 7 illustrates word recognition error rates as functions 
of Z for linear weights (Equation 14). 

DESCRIPTION OF PREFERRED 
EMBODIMENTS OF THE PRESENT 
INVENTION 

65 According to one embodiment of speech recognition 
system, a mean estimator should have the following require- 
ments: 
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It should allow integration of prior knowledge on the In practice, to obtain a reliable estimate of the variance 

mean. o^q is difficult because the unavailability of training data 

The estimate should be asymptotically the utterance covering all potential testing environments. In addition, our 

mean, i.e., approach the utterance mean as the number recognition system is expected to work even in unknown 

of observed frames becomes large, 5 environments. We therefore choose to adjust p by experi- 

It has to be sequential, and computationally efficient. ments. Denote 
In accordance with one embodiment of the present 

invention, an estimate of the utterance mean is achieved by 4 p (12) 

maximum a posterior probability (MAP) estimation. MAP P'^"- 

estimation allows optimal combination of newly acquired 10 

data and existing knowledge, through incorporation of prior equation 11 can be written 
information in the estimation of a parameter by assuming a 

prior distribution of it. niA,^p(n)-ct(n)mo+(l-a(n))m„ (13) 

It is assumed that: ^^^^ Equation 13 is a generalization of Equation 7 

The utterance mean m is a Gaussian random variable 15 ^ ^^^^ ^^^^^^ functional form for a(n). 

(R,V) with mean X and variance . Typically, a(n) is any decreasing function of the number of 

cr is fixed and known. available frames. It is expected that such generalization 

\ in turn is a random variable with a prior distribution could help to compensate the inaccuracy introduced by the 

Po(?^)' assumptions made. Here we study two variants of a(n). 

For MAP estimation, a prior distribution that imposes We can choose a piece-wise hnear decay for a(n): 

constraints on the values of \ must be chosen. We use 

conjugate priors for its mathematical attraction (see M. H. 

DeGroot, Optimum Statistic Decisions^ New York: a{n) = ' 

McGraw-Hill, 1970) and popularity for similar tasks (see J. 

L, Gauvain and C. H, Lee, "Maximum A Posteriori Estima- 

tion for Multivariate Gaussian Observations of Markov 

Chains," IEEE Trans, on Speech and Audio Processing^ ^^^^^ ^ ^ ^^^^^^^ between two frames (frame rate) 

2(2): 291-298, April 1994). A conjugate prior for a R.V is and Z is the frame where a(n) goes to (and stays at) 0. 

the prior distribution for the parameter >. of the pdf of the Another possibility is a(n) exponentially decaying: 
R.V., such that the posterior distribution p()w/X)and prior 

distribution po(X) belongs to the same distribution family for _ M ' if « = 0; (15) 

any sample size and any value of observation X, The \{a{n-\)'xy, otherwise, 
conjugate prior for the mean of Gaussian density is known 

to be a Gaussian density: , ^ a * 1 *u . c 1 j 

35 where 0<y<1 controls the rate of exponential decay, 

FIG. 2 shows two typical linear and exponential decays 

pq{X) = N{X\ mo, crj). for a 20 ras frame rate. 

Referring to FIG. 3 there is illustrated the recognizer 

1. according to one embodiment of the present invention. As in 

-Hie MAP estimation of m has been extensively studied and ^^^^^ ^ recognizer 11 and the source 13 of HMM 

the estimate is given by (see R. 0. Duda and R E Hart ^^^^^ ^ ^^^^^ preprocessed before being 

Pattern Classification and Scene Analysis, John Wiley & ^^^^^^ recognizer 11. The mean over the training data 

Sons, New York, 1973); ^J^ltiplied by one of the two variants determined 

by either Equation 14 (piece-wise linear MAP) or Equation 

^ 15 (exponential MAP) to get cx(n)mQ. The ML estimate of 

h rtffj cr2 4- nal the Utterance mean (m„) from the data observed up to time 

n is approximated by 



if rt = 0; (14) 



^a{n - 1) - — - oj, otherwise, 



nai cr^ (9) 



where m„, given in Equation 5, is the ML estimate of the 
utterance mean from the data observed up to time n. Denote ^ i V 

^ 50 ""fel' 

A (10) 

^ That is, at time n, m„ is calculated using those vectors 

observed up to time n only. This m„ is then multiplied by 
Equation 9 becomes: l-a(n) at 18 where a(n) again comes from either Equation 

14 (piece-wise linear MAP) or Equation 15 (exponential 
n p MAP). The a(n) output from multiplier 16 and l-ct(n) 

~ ^Tn'"" pTn'"'* output from muhiplier 18 are summed at summer 19 to get 

the log-spectral mean c„. The input signals are sampled, 
60 windowed, and transformed to produce the n''' log-spectral 
We point out that vector c„. The log-spectral mean m^v^^^n) up to frame n is 

if no sample is available (n=0) or the prior mean is known subtracted from the log-spectral vector c„ to get the mean- 
with certainty (p«oo), then the MAP estimate of the subtracted vector Q„ which is applied to the recognizer 11. 
mean is the prior mean uIq, The validation of the techniques are based on a 7-10 

If sample size is very large (n«oo) or the prior mean is 65 connected telephone digit recognition task. 

known with low certainty (p«0), then the MAP esti- We use 8 kHz sampling rate, 20 ms frame rate with 

mate of mean is the ML estimate of the mean. pre-emphasis. Observation vectors are lO'^'-order Linear 
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Predictive Coding (LPC) derived 13 Mel Frequency Ceps- 
tral CoefiBcients (MFCC) with their regression-based first 
order lime derivatives. Acoustic models are phonetic Gaus- 
sian Mixture Hidden Markov Models (GMHMM). The 
phone models are word-dependent. There are 47 models for 
digit recognition. The HMMs have on average about 5 
states. 

Speech data are from the MACROPHONE database (J, 
Bernstein et al., "Macrophone: An American Telephone 
Speech Corpus for the Polyphone Project," Proc. of IEEE 
Internat. Conf, on Acoustics, Speech and Signal Processing, 
volume 1, pages 81-84, Adelaide, Australia, April 1994). To 
reduce training time, we used a subset of 1844 utterances 
from the training data set. The test was performed on the 
standard test data set containing about 1200 utterances. 

Throughout the experiments, we used on average 4 Gaus- 
sians per state, which gives a WER of 1.57% with cepstral 
mean normalization (S. Fumi, "Cepstral Analysis Technique 
for Automatic Speaker Verification," IEEE Trans. Acoust., 
Speech and Signal Processing, ASSP -29(2): 254-272, 1981) 
and 2.15% without. 

For MAP estimation, the prior mean vector (mg) is 
computed as the average of all cepstral vectors in the 
training subset. 

In all figures showing WER curve below, three curves are 
displayed: 

1. Training is performed with standard utterance-based 
CMN and test is performed with sequential mean 
estimation. The curve label starts with T-. 

2. Training and testing all use sequential mean estimation. 

3. Training and testing all use standard utterance-based 
CMN. The curve label starts with CM-. 

TABLE 1 



Description of result tables 



Description 


Equation 


Figure 


Expoaentiai smoothing 


Equations 2,6 


no. 4 


p-controlied MAP 


Equation 11 


no. 5 


Exponential MAP 


Equation 15 


FIG. 6 


Piece-wise linear MAP 


Equation 14 


FIG. 7 



TABLE 2 



lowest WER for each mean normalization technique 



Description 


Equation 


WER 


Parameter 


Cumulative mean 


Equation 5 


1.74 


N.A. 


removal 






Exponential 


Equation 2 


1.69 


a « 0.98 


smoothing 








p- control led MAP 


Equation 11 


1.61 


p-35 


Exponential MAP 


Equation 15 


1.60 


Y - 0.985 


Piece-wise linear 


Equation 14 


1.57 


Z» 2300 


MAP 






CMN 


Equation 1 


1.57 


N.A. 



Table 2 compares the lowest WER for each technique: 
To do a cross-database validation, 1390 10-digit utter- 
ances from another database were recognized using the 



above CMN models and the parameter settings in Table 2, 
for Cumulative mean removal, p-coatrolled MAP, Piece - 
wise linear MAP, and CMN. The restilts are shown in Table 
3. 

TABLE 3 

WER for each mean normalization technique 
on another telephone speech database 



10 



15 



20 



25 



30 



35 



The content of the figures below is summarized in Table 
1. Results for cumulative mean removal (Equation 5) are 
also shown in these figures as special points: p=0 for the 
p-controlled MAP (Equation 11); y=0 for the exponential 
MAP (Equation 15); and Z=0 for the piece-wise linear MAP 
(Equation 14). The best WER for cumulative mean removal 
is 1.74 (corresponding to p=0 in the MAP Estimator of 
Equation 11). 



50 



55 



60 



65 



Descry) tion 


Equation 


WER 


Cumulative mean 


Equation 5 


153 


removal 






p-controllcd MAP 


Equation 11 


2.39 


Piece-wise linear MAP 


Equation 34 


2.38 


CMN 


Equation 1 


2.09 



From the results we observe the following: 

1. Among the functional forms for a(n) that were tested, 
piece-wise Linear approximation of MAP gives the best 
results. The optimal zero-crossing point for the linear 
decaying function is 2300 ms. Using a larger zero- 
crossing point will not help the WER because the prior 
mean mo will prevent the estimated mean from becoming 
utterance specific. It was reported (C. Mokbel, D. Jouvet 
and J. Monn, "Deconvolution of Telephone Line Effects 
for Speech Recognition," Speech Communication, 19(3): 
185-196, 1996) that averaging cepstral vectors on a few 
seconds of speech produces a reliable estimate of the 
constant convolved perturbation. 

2. All three MAP-bases techniques give noticeably better 
results than the two smoothing techniques. This shows 
that, at the beginning of an utterance, using prior mean 
gives a better estimate of the utterance mean. 

3. The two tested generalized MAP all give better results 
than the traditional MAP estimation. 

4. When sequential mean removal is used in both training 
and testing, the WER as a function of control parameters 
is irregular. This is probably due to the limited amount 
(1844) of utterances for training. 

5. Training with CMN and testing with sequential mean 
removal gives lower WER than when training and testing 
both employ sequential mean removal, 

6. Utterance -based CMN always gives the better results than 
sequential mean removal. 

Experiments show that MAP with piece-wise linear 
approximation, which does not require any look-ahead and 
thus can operate real-time, gives the lowest WER among all 
tested sequential mean removal techniques and performs as 
well as whole utterance based mean removal. 

What is claimed is: 

1. A speech recognition system comprising: 
a recognizer; 

a generalized maximum a posteriori estimator for deter- 
mining utterance log-spectral mean; 

means for subtracting said utterance log-spectral mean 
from log-spectral vector of the incoming speech signal; 
and 

means for coupling said means for subtracting to the input 
of said recognizer for providing mean subtracted vector 
of the input signal to said recognizer, 

2. The recognition system of claim 1 wherein speech 
recognition models are also applied to said recognizer. 

3. The recognition system of claim 2 wherein said speech 
models are HMM models. 

4. ITie recognition system of claim 1 wherein said maxi- 
mum a posteriori estimator follows the following equation 
of: 
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inM^/^«i)-a(n)mo+{l -a(n))in„ 

where mo is mean of training data, m„ is the ML estimate of 
the utterance mean from the data observed up to time n and 



8 



where m^ is mean of training data, m„ is the ML estimate of 
the utterance mean from the data observed up to time n and 



I " 



and cx(n) is piece-wise linear MAP where 

1. ifn = 0; 

mso^ain - 1) - ^ , oj, otherwise. 



10 and a(n) is exponential decaying where 



= 1' 



if n = 0; 
i)xy otherwise, 



where D is the time interval between two frames (frame rate) 
and Z is the frame where a(n) goes to and stays at 0. 

5. The recognition system of claim 1 wherein the said 
generalized maximum a posteriori estimator follows the 
following equation of 



15 



where 0<Y<controls the exponential decrease. 



20 
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